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Abstract 


We study the classic online learning problem of predicting with expert advice, and propose a truly 
parameter-free and adaptive algorithm that achieves several objectives simultaneously without using any 
prior information. The main component of this work is an improved version of the NormalHedge.DT 
algorithm [Luo and Schapire, 2014], called AdaNormalHedge. On one hand, this new algorithm ensures 
small regret when the competitor has small loss and almost constant regret when the losses are stochas¬ 
tic. On the other hand, the algorithm is able to compete with any convex combination of the experts 
simultaneously, with a regret in terms of the relative entropy of the prior and the competitor. This re¬ 
solves an open problem proposed by Chaudhuri et al. [2009] and Chernov and Vovk [2010], Moreover, 
we extend the results to the sleeping expert setting and provide two applications to illustrate the power 
of AdaNormalHedge: 1) competing with time-varying unknown competitors and 2) predicting almost as 
well as the best pruning tree. Our results on these applications significantly improve previous work from 
different aspects, and a special case of the first application resolves another open problem proposed by 
Warmuth and Koolen [2014] on whether one can simultaneously achieve optimal shifting regret for both 
adversarial and stochastic losses. 

1 Introduction 

The problem of predicting with expert advice was first pioneered by Littlestone and Warmuth [1994], Fre¬ 
und and Schapire [1997], Cesa-Bianchi et al. [1997], Vovk [1998] and others two decades ago. Roughly 
speaking, in this problem, a player needs to decide a distribution over a set of experts on each round, and 
then an adversary decides and reveals the loss for each expert. The player’s loss for this round is the expected 
loss of the experts with respect to the distribution that he chose, and his goal is to have a total loss that is 
not much worse than any single expert, or more generally, any fixed and unknown convex combination of 
experts. 

Beyond this classic goal, various more difficult objectives for this problem were studied in recent years, 
such as: learning with unknown number of experts and competing with all but the top small fraction of 
experts [Chaudhuri et al., 2009, Chernov and Vovk, 2010]; competing with a sequence of different combi¬ 
nations of the experts [Herbster and Warmuth, 2001, Cesa-Bianchi et al., 2012]; learning with experts who 
provide confidence-rated advice [Blum and Mansour, 2007]; and achieving much smaller regret when the 
problem is “easy” while still ensuring worst-case robustness [de Rooij et al., 2014, Van Erven et al., 2014, 
Gaillard et al., 2014]. Different algorithms were proposed separately to solve these problems to some extent. 
In this work, we essentially provide one single parameter-free algorithm that achieves all these goals with 
absolutely no prior information and significantly improved results in some cases. 
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Our algorithm is a valiant of Chaudhuri et al. [2009] 's NormalHedge algorithm, and more specifically 
is an improved version of NormalHedge.DT [Luo and Schapire, 2014]. We call it Adaptive NormalHedge 
(or AdaNormalHedge for short). NormalHedge and NormalHedge.DT provide guarantees for the so-called 
e-quantile regret simultaneously for any e, which essentially corresponds to competing with a uniform dis¬ 
tribution over the top e-fraction of experts. Our new algorithm improves NormalHedge.DT from two aspects 
(Section 3): 

1. AdaNormalHedge can compete with not just the competitor of the specific form mentioned above, but 
indeed any unknown fixed competitor simultaneously, with a regret in terms of the relative entropy 
between the competitor and the player’s prior belief of the experts. 

2. AdaNormalHedge ensures a new regret bound in terms of the cumulative magnitude of the instanta¬ 
neous regrets, which is always at most the bound for NormalHedge.DT (or NormalHedge). Moreover, 
the power of this new form of regret is almost the same as the second order bound introduced in a 
recent work by Gaillard et al. [2014]. Specifically, it implies 1) a small regret when the loss of the 
competitor is small and 2) an almost constant regret when the losses arc generated randomly with a 
gap in expectation. 

Our results resolve the open problem asked in Chaudhuri et al. [2009] and Chernov and Vovk [2010] on 
whether a better e-quantile regret in terms of the loss of the expert instead of the horizon can be achieved. 
In fact, our results arc even better and more general. 

AdaNormalHedge is a simple and truly parameter-free algorithm. Indeed, it does not even need to know 
the number of experts in some sense. To illustrate this idea, in Section 4 we extend the algorithm and results 
to a setting where experts provide confidence-rated advice [Blum and Mansour, 2007], We then focus on a 
special case of this setting called the sleeping expert problem [Blum, 1997, Freund et al., 1997], where the 
number of “awake” experts is dynamically changing and the total number of underlying experts is indeed 
unknown. AdaNormalHedge is thus a very suitable algorithm for this problem. To show the power of all 
the abovementioned properties of AdaNormalHedge, we study the following two examples of the sleeping 
expert problem and use AdaNormalHedge to significantly improve previous work. 

The first example is adaptive regret, that is, regret on any time interval, introduced by Hazan and Se- 
shadhri [2007], This can be reduced to a sleeping expert problem by adding a new copy of each original 
expert on each round [Freund et al., 1997, Koolen et al., 2012], Thus, the total number of sleeping experts is 
not fixed. When some information on this interval is known (such as the length, the loss of the competitor on 
this interval, etc), several algorithms achieve optimal regret [Hazan and Seshadhri, 2007, Cesa-Bianchi et ah, 
2012], However, when no prior information is available, all previous work gives suboptimal bounds. We 
apply AdaNormalHedge to this problem. The resulting algorithm, which we called AdaNormalHedge.TV, 
enjoys the optimal adaptive regret in not only the adversarial case but also the stochastic case due to the 
properties of AdaNormalHedge. 

We then extend the results to the problem of tracking the best experts where the player needs to compete 
with the best partition of the whole process and the best experts on each of these partitions [Herbster and 
Warmuth, 1995, Bousquet and Warmuth, 2003], This resolves one of the open problems in Warmuth and 
Koolen [2014] on whether a single algorithm can achieve optimal shifting regret for both adversarial and 
stochastic losses. Note that although recent work by Sani et al. [2014] also solves this open problem in some 
sense, their method requires knowing the number of partitions and other information ahead of time and also 
gives a worse bound for stochastic losses, while AdaNormalHedge.TV is completely parameter-free and 
gives optimal bounds. 
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We finally consider the most general case where the competitor varies over time with no constraints, 
which subsumes the previous two examples (adaptive regret and shifting regret). This problem was intro¬ 
duced in Herbster and Warmuth [2001] and later generalized by Cesa-Bianchi et al. [2012], Their algorithm 
(fixed share) also requires knowing some information on the sequence of competitors to optimally tune pa¬ 
rameters. We avoid this issue by showing that while this problem seems more general and difficult, it is in 
fact equivalent to its special case : achieving adaptive regret. This equivalence theorem is independent of 
the concrete algorithms and may be of independent interest. Applying this result, we show that without any 
parameter tuning, AdaNormalHedge.TV automatically achieves a bound comparable to the one achieved by 
the optimally tuned fixed share algorithm when competing with time-varying competitors. 

Concrete results and detailed comparisons on this first example can be found in Section 5. To sum 
up, AdaNormalHedge.TV is an algorithm that is simultaneously adaptive in the number of experts, the 
competitors and the way the losses arc generated. 

The second example we provide is predicting almost as well as the best pruning tree [Helmbold and 
Schapire, 1997], which was also shown to be reducible to a sleeping expert problem [Freund et al., 1997]. 
Previous work either only considered the log loss setting, or assumed prior information on the best pruning 
tree is known. Using AdaNormalHedge, we again provide better or comparable bounds without knowing 
any prior information. In fact, due to the adaptivity of AdaNormalHedge in the number of experts, our regret 
bound depends on the total number of distinct traversed edges so far, instead of the total number of edges of 
the decision tree as in Freund et al. [1997] which could be exponentially larger. Concrete comparisons can 
be found in Section 6. 

Related work. While competing with any unknown competitor simultaneously is relatively easy in the log 
loss setting [Littlestone and Warmuth. 1994, Adamskiy et al., 2012, Koolen et al., 2012], it is much harder 
in the bounded loss setting studied here. The well-known exponential weights algorithm gives the optimal 
results only when the learning rate is optimally tuned in terms of the competitor [Freund and Schapire, 1999]. 
Chernov and Vovk [2010] also studied e-quantile regret, but no concrete algorithm was provided. Several 
work considers competing with unknown competitors in a different unconstrained linear optimization setting 
[Streeter and Mcmahan, 2012, Orabona, 2013, McMahan and Orabona, 2014, Orabona, 2014]. Jadbabaie 
et al. [2015] studied general adaptive online learning algorithms against time-varying competitors, but with 
different and incomparable measurement of the hardness of the problem. As far as we know, none of the 
existing algorithms enjoys all the nice properties discussed in this work at the same time as our algorithms 
do. 

2 The Expert Problem and NormalHedge.DT 

In the expert problem, on each round t = 1,..., T: the player first chooses a distribution p f over N experts, 
then the adversary decides each expert’s loss () a € [0,1], and reveals these losses to the player. At the end 
of this round, the player suffers the weighted average loss It = Pz • It with £ t = {it, i, • • • ■ £i,n)- We denote 
the instantaneous regret to expert i on round t by ry.j = — £ t ,i> the cumulative regret by R t ^ = W * w l r T 

and the cumulative loss by L t) i = Ylr=i Throughout the paper, a bold letter denotes a vector with 
N corresponding coordinates. For example, r t , R/ and L, represent (ry, i ...., r^yv), (R/a • • • •, R/.n) and 
, L t)N ) respectively. 

Usually, the goal of the player is to minimize the regret to the best expert, that is, max,; Rt.i- Here 
we consider a more general case where the player wants to minimize the regret to an arbitrary convex 
combination of experts: R'](u) = Ylt=i u ' r t where the competitor u is a fixed unknown distribution over 
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the experts. In other words, this regret measures the difference between the player’s loss and the loss that he 
would have suffered if he used a constant strategy u all the time. Clearly, Rt( u) can be written as u • Ry 
and can then be upper bounded appropriately by a bound on each Rt (for example, max* Rt.i)- However, 
our goal is to get a better and more refined bound on i?r(u) that depends on u. More importantly, we aim 
to achieve this without knowing the competitor u ahead of time. When it is clear from the context, we drop 
the subscript T in Rt( u). 

In fact, in Section 5, we will consider an even more general notion of regret introduced in Herbster and 
Warmuth [2001], where we allow the competitor to vary over time and to have different scales. Specifically, 
let ui,..., ut be T different vectors with N nonnegative coordinates (denoted by U| .7-). Then the regret of 
the player to this sequence of competitors is R(u\ : t) = Ylt =1 u t r t- If a ll these competitors arc distributions 
(which they arc not required to be), then this regret captures a very natural and general concept of comparing 
the player’s strategy to any other strategy. Again, we arc interested in developing low-regret algorithms that 
do not need to know any information of this sequence of competitors beforehand. 

We briefly describe a recent algorithm for the expert problem, NormalHedge.DT [Luo and Schapire, 
2014] (a valiant of NormalHedge [Chaudhuri et al., 2009]), before we introduce our new improved valiants. 

On round t, NormalHedge.DT sets pt,i oc exp ( ^ t ~ 3 t + ^ + ) ~ ex P ( ^*"3 \ , where [a?] + = max{0,x}. 

Let e € (0,1] and competitor u* be a distribution that puts all the mass on the [AT]-th best expert, that is, 
the one that ranks \Ne] among all experts according to their total loss Lt.i from the smallest to the largest. 

Then the regret guarantee for NormalHedge.DT states R( u* ) < O ^ \Jt In (^ 7 ^-) j simultaneously for all e, 
which means the algorithm suffers at most this amount of regret for all but an e fraction of the experts. Note 
that this bound does not depend on N at all. This is the first concrete algorithm with this kind of adaptive 
property (the original NormalHedge [Chaudhuri et ah, 2009] still has a weak dependence on N ). In fact, 
as we will show later, one can even extend the results to any competitor u. Moreover, we will improve 
NormalHedge.DT so that it has a much smaller regret when the problem is “easy” in some sense. 

Notation. We use [N] to denote the set {1,..., N}, Ayv to denote the simplex of all distributions over [N], 
and RE(- 11 •) to denote the relative entropy between two distributions. Also define L t t = l [i T ^ — t r ] + . 
Many bounds in this work will be in terms of Lt,u which is always at most Ljj since trivially \£ t ,i — (/] + < 
l t)l . We consider “log log” terms to be nearly constant, and use ()() notation to hide these terms. Indeed, as 
pointed out by Chernov and Vovk [2010], In In x is smaller than 4 even when x is as large as the age of the 
universe expressed in microseconds (~ 4.3 x 10 17 ). 


3 A New Algorithm: AdaNormalHedge 


We start by writing NormalHedge.DT in a general form. We define potential function <h(7L C) = exp 
with 4>(0, 0) defined to be 1, and also a weight function with respect to this potential: 



w(R, C) = \ (HR +1,C + 1)-$(R-1,C + 1)) . 


Then the prediction of NormalHedge.DT is simply to set p t ,i to be proportional to w(Rt-i,i, Ct- 1 ) where 
C t = t for all t. Note that C t is closely related to the regret. In fact, the regret is roughly of order \fCf 
(ignoring the log term). Therefore, in order to get an expert-wise and more refined bound, we replace Ct by 
C Ll for each expert so that it captures some useful information for each expert i. There arc several possible 
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Algorithm 1 AdaNormalHedge 

Input: A prior distribution q G A ; y over experts (uniform if no prior available). 

Initialize: Vi € [TV] , R, a i = 0, Co,* = 0. 
for t = 1 to T do 

Predict 1 p t)l oc qiw{Rt-i t i,C t -i,i). 

Adversary reveals loss vector £ t and player suffers loss £ t = p t • i t . 

Set Vi £ [TV] , rt,i = £t ^t,ii Rt,i = Rt—i,i V ft,ii = Ct—i,* |r'r.r |. 

end for 

choices for (discussed at the end of Appendix A), but for now we focus on the one used in our new 
algorithm: Ct,i = l r T,i|> that is, the cumulative magnitude of the instantaneous regrets up to time t. 

We call this algorithm AdaNormalHedge and summarize it in Algorithm 1. Note that we even allow the 
player to have a prior distribution q over the experts, which will be useful in some applications as we will 
see in Section 5. The theoretical guarantee of AdaNormalHedge is stated below. 

Theorem 1. The regret of AdaNormalHedge to any competitor u £ A n is bounded as follows: 

R( u) < V3(u • Qr) (RE(u || q) + InB + ln(l + In TV)) = 6(y/(u • C T )RE(u || q)), (1) 

where Ct = (Cr,i,..., Ct,n )» B = 1 + 1 Yli Qi (1 + hr(l + Ct,i )) < § + § ln(l + T). Moreover, if u is 
a uniform distribution over a subset of [TV], then the regret can be improved to 

R{ u) < v / 3(u • Cr) (RE(u || q) + InH + 1). (2) 

Before we prove this theorem (see sketch at the end of this section and complete proof in Appendix A), 
we discuss some implications of the regret bounds and why they arc interesting. First of all, the relative 
entropy term RE(u 11 q) captures how close the player’s prior is to the competitor. A bound in terms of 
RE(u 11 q) can be obtained, for example, using the classic exponential weights algorithm but requires care¬ 
fully tuning the learning rate as a function of u. Without knowing u, as far as we know, AdaNormalHedge 
is the only algorithm that can achieve this. 2 

On the other hand, if q is a uniform distribution, then using bound (2) and the fact Cr.i < T, we get an e- 
quantile regret bound similar to the one of NormalHedge.DT: U{\i *) < R.(u Sf ) < yj3T (In (i) +lnH + l) 
where U 5 e is uniform over the top [TVe] experts, in terms of their total loss Lt,%. 

However, the power of a bound in terms of Ct is far more than this. Gaillard et al. [2014] introduced 
a new second order bound that implies much smaller regret when the problem is easy. It turns out that our 
seemingly weaker first order bound is also enough to get the exact same results! We state these implications 
in the following theorem which is essentially a restatement of Theorems 9 and 11 of Gaillard et al. [2014] 
with weaker conditions. 

Theorem 2. Suppose an expert algorithm guarantees R( u) < y/ (u • Ct)A(u) where ,4(u) is some func¬ 
tion of u. Then it also satisfies the following: 

1. Recall L r p i = Ylt =1 A,,i. and Tt,i = Ylt=\ At,i ~ h]+- W 'e have 

R( u) < \J 2(u • Lt’)A(u) + A(u) < ^/2(u • Lr)A(u) + A(u). 

1 If Pi, i oc 0 happens for all i, predict arbitrarily. 

2 In fact, one can also derive similar bounds for NormalHedge and NormalHedge.DT using our analysis. See discussion at the 
end of Appendix A. 
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2. Suppose the loss vector it's are independent random variables and there exists an i* and some a € 
(0,1] such that E [it,i — it,i *] > a for any t and i i*. Let be a distribution that puts all the 
mass on expert i*. Then we have E ^ ’ an ^ with probability at least 1 — 5, Rt,i * < 



The proof of Theorem 2 is based on the same idea as in Gail lard et al. [2014], and is included in Ap¬ 
pendix B for completeness. For AdaNormalHedge, the term .4(u) is 3(RE(u || q) + In B + ln(l + In N)) 
in general (or smaller for special u as stated in Theorem 1). Applying Theorem 2 we have -R(u) = 

d(^\] (u • Lr)RE(u || q)^ , 3 Specifically, if q is uniform and assuming without loss of generality that 

Lt .i < • • • < Lt,n, then by a similar argument, we have for AdaNormalHedge, R( u*) < O ^ \JL T ^ Ne -\ In ( 
for any e. This answers the open question (in the affirmative) asked by Chaudhuri et al. [2009] and Chernov 
and Vovk [2010] on whether an improvement for small loss can be obtained for e-quantile regret without 
knowing e. 

On the other hand, when we arc in a stochastic setting as stated in Theorem 2, AdaNormalHedge ensures 
Rt,i* < O ^ In j j in expectation (or with high probability with an extra confidence term), which does 
not grow with T. Therefore, the new regret bound in terms of C t actually leads to significant improvements 
compared to NormalHedge.DT. 

Comparison to Adapt-ML-Prod [Gaillard et al., 2014]. Adapt-ML-Prod enjoys a second order bound 
in terms of J2j=i r t v which is always at most the term Ylt=i \ r t,i\ appeared in our bounds. 4 However, on 
one hand, as discussed above, these two bounds have the same improvements when the problem is easy in 
several senses; on the other hand, Adapt-ML-Prod does not provide a bound in terms of RE(u || q) for 
an unknown u. In fact, as discussed at the end of Section A.3 of Gaillard et al. [2014], Adapt-ML-Prod 
cannot improve by exploiting a good prior q (or at least its current analysis cannot). Specifically, while 
the regret for AdaNormalHedge does not have an explicit dependence on N and is much smaller when the 
prior q is close to the competitor u, the regret for Adapt-ML-Prod always has a In N multiplicative term for 
YlJ=i r ti ’ which means even a good prior results in the same regret as a uniform prior! More advantages of 
AdaNormalHedge over Adapt-ML-Prod will be discussed in concrete examples in following sections. 

Proof sketch of Theorem 1. The analysis of NormaHedge.DT is based on the idea of converting the 
expert problem into a drifting game [Schapire, 2001, Luo and Schapire, 2014], Here, we extract and 
simplify the key idea of their proof and also improve it to form our analysis. The main idea is to show 
that the weighted sum of potentials does not increase much on each round using an improved version of 
Lemma 2 of Luo and Schapire [2014]. In fact, we show that the final potential YliLi Ct,%) 

is exactly bounded by B (defined in Theorem 1). From this, assuming without loss of generality that 
qi$(i?T,ij Ct,i) > ••• > qN&(RT,N,CT,N), we have gj<f>(it/r,i, Or,*) < j- for all i, which, by solving 

for Rx,i, gives R.r.i < \J-lCr.i hi (t§~ j ■ Multiplying both sides by Ui, summing over N and applying 

the Cauchy-Schwarz inequality, we arrive at R( u) < yEifu ■ C 7 -)(/T(n || qj + hi IT), where we define 

3 We see 0(RE(u 11 q)) as a minor term and hide it in the big O notation, which is not completely rigorous but will ease the 
presentation. Same thing happens for other first order bounds in this work. 

4 We briefly discuss the difficulty of getting a similar second order bound for our algorithm at the end of Appendix A. 
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d(u II q) = Eili Ui In j. It remains to show that I )(u 11 q) and RE(u || q) arc close by standard 
analysis and Stirling’s formula. 

4 Confidence-rated Advice and Sleeping Experts 

In this section, we generalize AdaNormalHedge to deal with experts that make confidence-rated advice, a 
setting that subsumes many interesting applications as studied by Blum [1997] and Freund et al. [1997]. In 
this general setting, on each round t, each expert first reports its confidence T Ll € [ 0 ,1] for the current task. 
The player then predicts p f as usual with an extra yet natural restriction that if X t i = 0 then p Lr = 0. That 
is, the player has to ignore those experts who abstain from making advice (by reporting zero confidence). 
After that, the loss £ ti for those experts who did not abstain (i.e. T t i f 0) arc revealed and the player still 
suffers loss £ t = pt • £t • We redefine the instantaneous regret r t q to be 2 t,i(£t ~ h,i)> diat is, the difference 
between the loss of the player and expert i weighted by the confidence. The goal of the player is, as before, 
to minimize cumulative regret to any competitor u: R(u) = 'Yh t<T u- ry. Clearly, the classic expert problem 
that we have studied in previous sections is just a special case of this general setting with I t>i = 1 for all t 
and i. 

Moreover, with this general form of ryy, AdaNormalHedge can be used to deal with this general setting 
with only one simple change of scaling the weights by the confidence: 

Pt,i oc (3) 

where R t q and is still defined to be Et=i 7 T,i and Et=i l 7 V,i | respectively. The constraint T / ? ; = 
0 =4- Pt,i = 0 is clearly satisfied. In fact. Algorithm 1 can be seen as a special case of this general form 
of AdaNormalHedge with //_, = 1. Furthermore, the regret bounds in Theorem 1 still hold without any 
changes, which arc summarized below (proof deferred to Appendix A). 

Theorem 3. For the confidence-rated expert problem, regret bounds (1) and (2) still hold for general 
AdaNormalHedge (Eq. (3)). 

Previously, Gail I aid et al. [2014] studied a general reduction from an expert algorithm to a confidence¬ 
rated expert algorithm. Applying those results here gives the exact same algorithm and regret guarantee 
mentioned above. However, we point out that the general reduction is not always applicable. Specifically, 
it is invalid if there is an unknown number of experts in the confidence-rated setting (explained more in the 
next paragraph) while the expert algorithm in the standard setting requires knowing the number of experts 
as a parameter. This is indeed the case for most algorithms (including Adapt-MF-Prod and even the original 
NormalHedge by Chaudhuri et al. [2009]). AdaNormalHedge naturally avoids this problem since it does 
not depend on N at all. 

Sleeping Experts. We are especially interested in the case when T t q € {0,1}, also called the special¬ 
ist/sleeping expert problem where l t ,i = 0 means that expert i is “asleep” for round t and not making any 
advice. This is a natural setting where the total number of experts is unknown ahead of time. Indeed, the 
number of awake experts can be dynamically changing over time. An expert that has never appeared before 
should be thought of as being asleep for all previous rounds. 

AdaNormalHedge is a very suitable algorithm to deal with this case due to its independence of the total 
number of experts. If an expert i appeal's for the first time on round t, then by definition it will naturally 
start with = 0 and = 0. Although we state the prior q as a distribution, which seems to 
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require knowing the total number of experts, it is not an issue algorithmically since q is only used to scale 
the unnormalized weights (Eq. (3)). For example, if we want q to be a uniform distribution over N experts 
where N is unknown beforehand, then to run AdaNormalHedge we can simply treat q r in Eq. (3) to be 1 for 
all i, which clearly will not change the behavior of the algorithm anyway. In this case, if we let N t denote 
the total number of distinct experts that have been seen up to time t and the competitor u concentrates on 
any of these experts, then the relative entropy term in the regret (up to time t) will be In N t (instead of In N), 
which is changing over time. 

Using the adaptivity of AdaNormalHedge in both the number of experts and the competitor, we provide 
improved results for two instances of the sleeping expert problem in the next two sections. 

5 Time-Varying Competitors 

In this section, we study a more challenging goal of competing with time-varying competitors in the standard 
expert setting (that is, each expert is always awake and again r t q = i t — which turns out to be reducible 
to a sleeping expert problem. Results for this section are summarized in Table 1. 

5.1 Special Cases: Adaptive Regret and Tracking the Best Expert 

We start from a special case: adaptive regret, introduced by Hazan and Seshadhri [2007] to better capture 
changing environments. Formally, consider any time interval t = ti,..., t^, and let R[t lt t 2 \q = Ylt=ti r t,i 
be the regret to expert i on this interval (similarly define Lu lt2 ij = Y^t=t t ^t,i and = Y^t=t L \ r t,i\)- 

The goal of the player is to obtain relatively small regret on any interval. Freund et al. [1997] essentially in¬ 
troduced a way to reduce this problem to a sleeping expert problem, which was later improved by Adamskiy 
et al. [2012]. Specifically, for every pair of time t and expert i, we create a sleeping expert, denoted by (t. i), 
who is only awake after (and including) round t. and since then suffers the same loss as the original expert i. 
So we have Nt sleeping experts in total on round t. The prediction ptq is set to be the sum of all the weights 
of sleeping expert (r,i) (r = 1..... t). It is clear that doing this ensures that the cumulative regret up to 
time to with respect to sleeping expert (t] , i) is exactly R\ tl Ji2 J in the original problem. 

This is a sleeping expert problem for which AdaNormalHedge is very suitable, since the number of 
sleeping experts keeps increasing and the total number of experts is in fact unknown if the horizon T is 
unknown. Theorem 3 implies that the resulting algorithm gives the following adaptive regret: 

%,*],< = 6 (\j G£=i, MU” ( 5 ^)) = 6 (Vdi,, M)in(jvii)) , 

where q is a prior over the AT 2 experts and the last step is by setting the prior to be qui\ oc 1 /t 2 for all t and 
i. 5 This prior is better than a simple uniform distribution which leads to a term ln(AU9) instead of ln(AV]). 
We call this algorithm AdaNormalHedge.TV. 6 To be concrete, on round t AdaNormalHedge.TV predicts 

Pt,i oc Er=t F ? w { R [r,t-l],i, C'[r,t-l],i) • 

Again, Theorem 2 can be applied to get a more interpretable bound 0 ^ \JL\i t ,/ 2 ].i (At |) j where 

L[t lt t 2 ],i = Y2t=t! V-t-'i- ~ &t,i]+ < L[ti,t 2 ],ii an< i a niuch smaller bound O if the losses are stochastic 

on interval [U, £ 2 ] i n the sense stated in Theorem 2. 


^Note that as discussed before, the fact that t 2 is unknown and thus q is unknown does not affect the algorithm. 
6 “tv” S ( ant ] s f or “time-varying”. 



One drawback of AdaNormalHedge.TV is that its time complexity per round is O(Nt) and the overall 
space is O(NT). However, the data streaming technique used in Hazan and Seshadhri [2007] can be directly 
applied here to reduce the time and space complexity to O(Nlnt) and 0(A r In T) respectively, with only 
an extra multiplicative O ( y/ln(/: 2 — f i )) factor in the regret. 

Tracking the best expert. In fact, AdaNormalHedge.TV is a solution for one of the open problems pro¬ 
posed by Warmuth and Koolen [2014]. Adaptive regret immediately implies the so-called /v-shifting regret 
for the problem of tracking the best expert in a changing environment. Formally, define the K -shifting 
regret i?A-Shift to be max^ =1 ^\t k -i+^,t k ],i k where the max is taken over all h,--- ,ik G [N] and 
0 — to < t\ < ■■■ < tR- = T. In other words, the player is competing with the best K -partition of 
the whole game and the best expert on each of these partitions. Let L^. Shift = max^f =1 Lr tfc _ 1+1)tfc i j fc 
be the total loss of such best partition (that is, the max is taken over the same space), and similarly de- 
fine L* K _ shifx = max^£ =1 Z[ tfc _ 1+Ufc])ifc < L* K _ Shifv Since essentially R K _ shift is just the sum of K 
adaptive regrets, using the bounds discussed above and the Cauchy-Schwarz inequality, we conclude that 

AdaNormalHedge.TV ensures i?A-Shift = O ^ y/ KL* K _ Shm \n(NT)^j . Also, if the loss vectors arc gener¬ 
ated randomly on these K intervals, each satisfying the condition stated in Theorem 2, then the regret is 
^A'-Shift = O j j n expectation (high probability bound is similar). These bounds arc optimal up 

to logarithmic factors [Hazan and Seshadhri, 2007]. This is exactly what was asked in Warmuth and Koolen 
[2014]: whether there is an algorithm that can do optimally for both adversarial and stochastic losses in the 
problem of tracking the best expert. AdaNormalHedge.TV achieves this goal without knowing K or any 
other information, while the solution provided by Sani et al. [2014] needs to know K , L* K _ Shift and a to get 
the same adversarial bound and a worse stochastic bound of order 0(1/a 2 ). 

Comparison to previous work. For adaptive regret, the FLH algorithm by Hazan and Seshadhri [2007] 
treats any standard expert algorithm as a sleeping expert, and has an additive term 0{y/t 2 In t?) in addition to 
the base algorithm's regret (when no prior information is available), which adds up to a large 0(Ky/T In T) 
term for K -shifting regret. Due to this extra additive regret, FLH also does not enjoy first order bounds nor 
small regret in the stochastic setting, even if the base algorithm that it builds on provides these guarantees. 
On the other hand, FLH was proposed to achieve adaptive regret for any general online convex optimization 
problem. We point out that using AdaNormalHedge as the master algorithm in their framework will give 
similar improvements as discussed here. 

Adapt-ML-Prod is not directly applicable here for the corresponding sleeping expert problem since the 
total number of experts is unknown. 

Another well-studied algorithm for this problem is “fixed share". Several works on fixed share for the 
simpler “log loss” setting were studied before [Herbster and Warmuth, 1998, Bousquet and Warmuth, 2003, 
Adamskiy et al., 2012, Koolen et al., 2012], Cesa-Bianchi et al. [2012] studied a generalized fixed share 
algorithm for the bounded loss setting considered here. When t -2 — t\ and L arc known, their algorithm 

ensures R[t 1: t 2 ],i = O ^Ly tl ^~ ln(iV(f 2 — H))) for adaptive regret, and when K, T and L* K _ Shiit are 

known, they have i?A'-shift = O (yj KL* Kshm In (NT j K )). No better result is provided for the stochastic 
setting. More importantly, when no prior information is known, which is the case in practice, the best results 
one can extract from their analysis are R\t lt t 2 ],i = ^2 ln(ATt 2 )) and i?A'-Shift = 0(Ky/T In (NT)), 

which arc much worse than our results. 

7 For fair comparison, we only consider the case when no prior information (e.g. t\,t2,V (ui ; t) etc) is known. 
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Table 1: Comparison of different algorithms for time-varying competitors 7 


Algorithm 


Rif -Shift 

R(U1:t) 

AdaNormalHedge.TV 
(this work) 

\/ L[t u t 2 ],An{Nt i) 

\Ik~Lk-swMNT) 

\jV (ui:t)T(ui : t) In (NT) 

FLH 5 

[Hazan and Seshadhri, 2007] 

y/h, In f 2 

K ypT In T 

unknown 

Fixed Share 

[Cesa-Bianchi et al., 2012] 

y/t 2 ln(tVt 2 ) 

Ky/T\n.(NT) 

unknown 


5.2 General Time-Varying Competitors 

We finally discuss the most general goal: compete with different u/ on different rounds. Recall R(u\ : t) = 
u t • r '/. where u t € 1ft- v for all t (note that u/ does not even have to be a distribution). Clearly, adaptive 
regret and K -shifting regret are special cases of this general notion. Intuitively, how large this regret is 
should be closely related to how much the competitor’s sequence varies. Cesa-Bianchi et al. [2012] in¬ 
troduced a distance measurement to capture this variation: [/(ui-x) = Ylt=i YliLii u t,i — u t-i,i]+ where we 
define no= OforalH. Also let ||ui : t|| = Ylt=i || u t||i andL(ui ; r) = Ylt=i Fixed share is shown to 
ensure the following regret [Cesa-Bianchi et al., 2012]: R(rii-r) = O ( \JV(vl\-t)IA^\-t) In (A r ||u| : r ll/V(u 1:T ))) 
when V (ui : t), L(u\ : t) and 11ir 1 - 7 ^11 are known. No result was provided otherwise . 8 9 Here, we show that our 
parameter-free algorithm AdaNormalHedge.TV actually achieves almost the same bound without knowing 
any information beforehand. Moreover, while the results in Cesa-Bianchi et al. [2012] are specific for the 
fixed share algorithm, we prove the following results which arc independent of the concrete algorithms and 
may be of independent interest. 

Theorem 4. Suppose an expert algorithm ensures R[t 1 : t 2 ],i — \J A Ylt=ti z t,i f° r an y f 1 ■ C and i, where 

Zt,i A 0 can be anything depending on t and i (e.g. \rt,i\, [tt,i — 1 t]+, it,i ° r constant 1), and A is a term 
independent o/fi, and i. Then this algorithm also ensures 

R( ui:t) < \]AV(vl 1 :T )Ya=i u * ' z t- 

Specially, for AdaNormalHedge.TV, plugging A = 0(ln(NT)) and zt = [it,i ~~ &t\+ gives R(u\ : t) = 
d{f]v (ui : t)T(ui : t) In (ATT)), where L{ ui : t) = Yh=i Y^=i - h\+- 

The key idea of the proof is to rewrite R(ui : t) as a weighted sum of several adaptive regrets in an 
optimal way (see Appendix C for the complete proof). This theorem tells us that while playing with time- 
varying competitors seems to be a harder problem, it is in fact not any harder than its special case: achieving 
adaptive regret on any interval. Although the result is independent of the algorithms, one still cannot de¬ 
rive bounds on R(u]-t) for FLH or fixed share based on their adaptive regret bounds, because when no 
prior information is available, the bounds on R\t 1: t 2 ],i f° r these algorithms arc of order Oiyff ) instead of 
0('Jt r 2 — t\ ), which is not good enough. We refer the reader to Table 1 for a summary of this section. 

8 The choice of the base algorithm does not matter since the dominating term in the regret comes from FLH itself. 

9 Although in Section 7.3 of Cesa-Bianchi et al. [2012], the authors mentioned online tuning technique for the parameters, it 
only works for special cases (e.g. adaptive regret). 
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6 Competing with the Best Pruning Tree 


We now turn to our second application on predicting almost as well as the best pruning tree within a template 
tree. This problem was studied in the context of online learning by Helmbold and Schapire [1997] using 
the approach of Willems et al. [1993, 1995]. It is also called the tree expert problem in Cesa-Bianchi and 
Lugosi [2006]. Freund et al. [1997] proposed a generic reduction from a tree expert problem to a sleeping 
expert problem. Using this reduction with our new algorithm, we provide much better results compared to 
previous work (summarized in Table 2). 

Specifically, consider a setting where on each round t , the predictor has to make a decision y t from some 
convex set y given some side information x t , and then a convex loss function f) : y -X [0,1] is revealed 
and the player suffers loss The predictor is given a template tree Q to consult. Stalling from the 

root, each node of Q performs a test on x t to decide which of its child should perform the next test, until 
a leaf is reached. In addition to a test, each node (except the root) also makes a prediction based onx(. A 
pruning tree V is a tree induced by replacing zero or more nodes (and associated subtrees) of Q by leaves. 
The prediction of a pruning tree V given ay, denoted by 'Pfapj, is naturally defined as the prediction of the 
leaf that x t reaches by traversing V. The player’s goal is thus to predict almost as well as the best pruning 
tree in hindsight, that is, to minimize Rg = Ylt= l ft(Ut) — min-p Ylt=i /t(P(^t))- 

The idea of the reduction introduced by Freund et al. [1997] is to view each edge of Q as a sleeping expert 
(indexed by e), who is awake only when traversed by xt, and in that case predicts y /f; , the same prediction 
as the child node that it connects to. The predictor runs a sleeping expert algorithm with loss lt,e = ft(yt,e), 
and eventually predicts yt = where E denotes the set of edges of Q and pi. e (e € E) is the 

output of the expert algorithm; thus by convexity of f t , we have ft{yt) < Yl e eEPt,eft{yt,e) = f)- Note that 
we only care about the predictions of those awake experts since otherwise p t>e is required to be zero. Now 
let V* be one of the best pruning frees, that is, V* € arg min-p Y2t=i and m be the number of 

leaves of V*. In the expert problem, we will set the competitor u to be a uniform distribution over the m 
terminal edges (that is, the ones connecting the leaves) of V*, and the prior q to be a uniform distribution 
over all the edges. Since on each round, one and only one of those m experts is awake, and its prediction is 
exactly V*(xt), we have R( u) = J2t=i m — fti'P* (xt))), and therefore Rg < mR( u). 

It remains to pick a concrete sleeping algorithm to apply. There arc two reasons that make AdaNormal- 
Hedge very suitable for this problem. First, since m is clearly unknown ahead of time, we arc competing 
with an unknown competitor u, which is exactly what AdaNormalHedge can deal with. Second, the number 
of awake experts is dynamically changing, and as discussed before, in this case AdaNormalHedge enjoys 
a regret bound that is adaptive in the the number of experts seen so far. Formally, recall the notation N t , 
which in this case represents the total number of distinct traversed edges up to round t. Then by Theorem 3, 
we have 

Rg = d(my/( u- C T )RE(u || q)) = 6 (^jm. (Y^ =l \h ~ ft{V*{x t ))|) In » 

which, by Theorem 2, implies Rg = In(JVr/m)^ where L* = ( x t)) — ^t]+, which 

is at most the total loss of the best pruning tree L* = Y^t=\ Moreover, the algorithm is efficient: 

the overall space requirement is O(Ap), and the running time on round t is 0(||x'/ ||q) where we use \\xt\\g 
to denote the number of edges that xt traverses. 

Comparison to other solutions. The work by Freund et al. [1997] considers a valiant of the exponential 
weights algorithm in a “log loss” setting, and is not directly applicable here (specifically it is not clear how 
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Table 2: Comparison of different algorithms for the tree expert problem 


Algorithm 

Rg 

Time (per round) 

Space (overall) 

Need to know 
L* and ml 

AdaNormalHedge 
(this work) 

0{yJmL* in (^)) 

0(\\x t \\g) 

0(N t ) 

No 

Adapt-ML-Prod 
[Gaillard et al., 2014] 

0(y/mL* In A) 

0(\\x t \\g) 

0 (N t ) 

No 

Exponential Weights 
[Helmbold and Schapire, 1997] 

0(VmL*) 

0(d\\x t \\g) 

0 (N t ) 

Yes 


to tune the learning rate appropriately). A better choice is the Adapt-ML-Prod algorithm by Gaillard et al. 
[2014] (the version for the sleeping expert problem). However, there is still one issue for this algorithm: it 
does not give a bound in terms of RE(u 11 q) for an unknown u. 10 So to get a bound on R( u), the best thing 
to do is to use the definition R(u) = u • R/ and a bound on each Rr,i- In short, one can verify that Adapt- 

ML-Prod ensures regret R{ u) = 0 ruL* In A r j where N = \E\ is the total number of edges/experts. We 
emphasize that N can be much larger than Nt when the tree is huge. Indeed, while Nj- is at most T times 
the depth of Q, N could be exponentially large in the depth. The running time and space of Adapt-ML-Prod 
for this problem, however, is the same as AdaNormalHedge. 

We finally compare with a totally different approach [Helmbold and Schapire, 1997], where one simply 
treats each pruning tree as an expert and run the exponential weights algorithm. Clearly the number of 
experts is exponentially large, and thus the running time and space are unacceptable by a naive implementa¬ 
tion. This issue is avoided by using a clever dynamic programming technique. If L* and m arc known ahead 
of time, then the regret for this algorithm is 0(y/mL*) by tuning the learning rate optimally. As discussed 
in Lreund et al. [1997], the linear dependence on m in this bound is much better than the one of the form 
O (y/mL* In n\ which, in the worst case, is linear in m. This was considered as the main drawback of 


using the sleeping expert approach. However, the bound for AdaNormalHedge is Oy\JmL*\n{Nr/m)j, 

which is much smaller as discussed previously and in fact comparable to 0{y/mL*). More importantly, L* 
and m arc unknown in practice. In this case, no sublinear regret is known for this dynamic programming 
approach, since it relies heavily on the fact that the algorithm is using a fixed learning rate and thus the usual 
time-varying learning rate methods cannot be applied here. Therefore, although theoretically this approach 
gives small regret, it is not a practical method. The running time is also slightly worse than the sleeping 
expert approach. Lor simplicity, suppose every internal node has d children. Then the time complexity per 
round is 0(d||x||p). The overall space requirement is 0(N t ), the same as other approaches. Again, see 
Table 2 for a summary of this section. 

Linally, as mentioned in Lreund et al. [1997], the sleeping expert approach can be easily generalized to 
predicting with a decision graph. In that case, AdaNormalHedge still enjoys all the improvements discussed 
in this section (details omitted). 


"’in fact, even if it does, this term is still dominated by a In A term. See the discussion at the end of Section A.3 of Gaillard 
et al. [2014] that we already mentioned at Section 3. 
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A Complete proofs of Theorem 1 and 3 


We need the following two lemmas. The first one is an improved version of Lemma 2 of Luo and Schapire 
[2014], 


Lemma 1 . For any R € IF. C > 0 and r € [—f, 1], we have 

*(R + r,C + |r|) < C) + w(R, C)r + — 


Proof. We first argue that <h (R + r, C + |r|), as a function of r, is piecewise-convex on [—1,0] and [0,1]. 
Since the value of the function is 1 when R + r < 0 and is at least 1 otherwise. It suffices to only consider 
the case when R + r > 0. On the interval [0,1], we can rewrite the exponent (ignoring the constant l) as: 


(R + r ) 2 
C + r 


(C + ’')+ ( p + C r )2 +2(fi-C). 


which is convex in r. Combining with the fact that “if g(x) is convex then exp(r/(x)) is also convex” proves 
that <b( R, + r, C + |r|) is convex on [0,1]. Similarly when r € [—1, 0], rewriting the exponent as 


(.R + r ) 2 

C-r 


(« C-r) + 


C R + C ) 2 

C-r 


2 (R + C) 


completes the argument. 

Now define function f(r) = &(R + r,C + |?’|) — w(R,C)r. Since f(r) is cleaidy also piecewise-convex 
on [— 1 , 0 ] and [ 0 , 1 ], we know that the curve of f(r) is below the segment connecting points (— 1 , /(— 1 )) 
and ( 0 ,/( 0 )) on [— 1 , 0 ], and also below the segment connecting points ( 1 ,/( 1 )) and ( 0 ,/( 0 )) on [ 0 , 1 ]. 
This can be mathematically expressed as: 


f(r) < max{/( 0 ) + (/( 0 ) - /(-l))r,/( 0 ) + (/( 1 ) - f(0))r} = /( 0 ) + (/( 1 ) - /( 0 ))|r|, 
where we use the fact /(—1) = /(1). Now by Lemma 2 of Luo and Schapire [2014], we have 

/(I) - no) = l (HR + l,C + l) + HR-hC + 1)) - hr, C) < \ (exp ( 3 ^) - 1 ), 

4 a 

which is at most 2(c+i) s ' ncc C ' s nonnegative and e x — 1 < e f 1 x for any x € [0, a]. Noting that 

4 

e 3 — 1 < 3 completes the proof. □ 


The second lemma makes use of Lemma 1 to show that the weighted sum of potentials does not increase 
much and thus the final potential is relatively small. 

Lemma 2. AdaNormalHedge ensures YJv=\ Qi^(Rr,i, C T ,i) < B = 1 + § Yh=i 9i0- + + c T,i))- 

Proof First note that since AdaNormalHedge predicts pt, t oc q 1 w(Ri-\ J/ , C 7 - 1 .,), we have 

Yh=i Qi w (Rt-i,i, C t -\,i)r t ,i = 0 . (4) 

Now applying Lemma 1 with R = C = C t ~ip and r = rt^. multiplying the inequality by q, on 

both sides and summing over i gives JfiLi <H^( R t,i, Ct,i ) < Ya=i Qi$(Rt-i,i, C t -i,i) + | Eili cp^t+i ' 
We then sum over t € [T] and telescope to show YliLi Rt ,i, Crq) < 1 + | % Y^t= 1 cTTpT - 

Finally applying Lemma 14 of Gaillard et al. [2014] to show + hifl + C'ta) completes 

the proof. □ 
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We arc now ready to prove Theorem 1 and Theorem 3. 


Proof, (of Theorem 1 ) Assume q\ T( Rr. i ■ Ct, i) > • • • > q^'l'dlr.hq Ct,n) without loss of generality. 
Then by Lemma 2, it must be true that Crq) < -j for all z, which, by solving for Rrq, gives 

Rj' j < </.3CV.) hi V Multiplying both sides by zz t , summing over A r and applying the Cauchy-Schwarz 


inequality, we arrive at _R(u) < YjiLi y 3uiC T ,i -Ui In ^J < yj 3(u • Ct)(D(u || q) + In B), where we 
define D( u || q) = Ui In (jqp'j ■ It remains to show that D{ u || q) and RE(u || q) are close. Indeed, 

we have D (u 11 q) — RE(u 11 q) = YliLi u i I n (iT - ) > which, by standard analysis, can be shown to reach 

its maximum when u t oc - and the maximum value is In ff- i < ln(l + In At). This completes the proof 
for Eq. (1). 

Finally, when u is in the special form as described in Theorem 1, we have I)(u || q) — RE(u || q) = 
In |,S| + YlieS (l) — MS| ~ ]^ln(|S|!). By Stirling’s formula x\ > y/2itx (f)^, we arrive at 
D {u || q) — RE(u || q) < 1 — (In y / 27r|S'|)/|S| < 1, proving Eq. (2). □ 


Proof, (of Theorem 3) It suffices to point out that rtq is still in the interval [—1,1] and Eq. (4) in the proof 
of Lemma 2 still holds by the new prediction rule Eq. (3). The entire proof for Theorem 1 applies here 
exactly. □ 

The algorithm and the proof can be generalized to Ct.% = Y1t =t l r V.?: for any d G [0,1], Indeed, the 
only extra work is to prove the convexity of + r, C + |r 1^). When d = 0, we recover NormalHedge.DT 
exactly and get a bound on R( u) for any u (in terms of VT), instead of just R( u*) as in the original work. 
It is clear that d = 1 gives the smallest bound, which is why we use it in AdaNormalHedge. The ideal 
choice, however, should be d = 2 so that a second order bound si mi lar to the one of Gail I aid et al. [2014] 
can be obtained. Unfortunately, the function 4 >(R + r, C + r 2 ) turns out to not always be piecewise-convex, 
which breaks our analysis. Whether d = 2 gives a low-regret algorithm and how to analyze it remain an 
open question. 


B Proof of Theorem 2 

Proof. For the first result, the key observation is u • Cj = R{ u) + 2u • L t- We only consider the case when 
R(u) > 0 since otherwise the statement is trivial. By the condition we thus have R( u) 2 < ( R(u) + 2u • 
Lr)A(u), which by solving for i?(u) gives 

R{ u) < ^(A(u) + y 7 A( u) 2 + 8(u • Lt)A(u)) < ^2 (u • L T )A(u) + A(u), 
proving the bound we want. 

For the second result, let E f denote the expectation conditioning on all the randomness up to round t. 
So by the condition, we have E t[r t ,i*] = J2iLi PtqMtftq - £t,i*] > a(l - pt,i*), and thus E [R T ,i*] > 
aS where we define S = E[1 — pt,i*]- On the other hand, by convexity we also have [rv.,;* < 

Y,iLiPt,i\£t,i - £t,i* I < 1 - Pt,i* and thus E [R T ,i*\ < E [y/A(ei*)C T ,i*] < y/A(ei*)S by the concavity of 
the square root function. Combining the above two statements gives S < A< ' e f\ and plugging this back 
shows E\R T/) ,] < Efiiil' 7 h e high probability statement follows from the exact same argument of Gaillard 
et al. [2014] using a martingale concentration lemma. □ 
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C Proof of Theorem 4 


Below we use [s, t] to denote the set {s, s + 1,..., t — 1, t} for 1 < s < t < T and s,tG N + . 

Proof. We first fix an expert i and consider the regret to this expert Y2=i u t,i r t,i- Let Uj > 0 and Uj = 
[sj,tj] for j = 1,.... M be M positive numbers and M coiTesponding time intervals such that utj = 
ffjLi a,jl{t G Uj}. Note that this is always possible, with a trivial choice being cij = utj, Sj = tj = j 
and M = T; we will however need a more sophisticated construction specified later. By the adaptive regret 
guarantee, we have 


T 


M T 


M 


22 ut ^ = ^Z a iJ2 1 {t G Uj}rtj — ^ ^ cijR\gj i^ ''' 
t= 1 j =1 t=l 3=1 


M 


E 



Zt,ii 


which, by the Cauchy-Schwarz inequality, is at most 


M 

M tj 

M 

M T 

M 

T 

^2 a 3 ' \ 

a ^22 a t ^22 Zt >* = a 

22 a u \ 

A y ^ aj y ^ l{t G Uj}ztj = . 

22 a i ■ \ 


3=1 \ 

3=1 t—Sj \ 

3=1 \ 

3=1 t=1 \ 

3=1 \ 

t= i 


Therefore, we need a construction of a :j and Uj such that Tf jL i a j is minimized. This is addressed in 
Lemma 3 below which shows that there is in fact always a (optimal) construction such that J^jLi a j is 

exactly i u t,i ~ ut-i,i]+- Now summing the resulting bound over all experts and applying the Cauchy- 
Schwarz inequality again proves the theorem. □ 

Lemma 3. Let v\ be T nonnegative numbers and h({v\,... ,vt}) = m ^ n ^2jLi a j where the 

minimum is taken over the set of all possible choices of M G N + , aj > 0, Uj = [sj,tj\ with 1 < Sj < tj < 
T, Sj,tj € N + (j = 1,..., M) such that vt = YljLi a jl{t € Uj} for all t. Then with no defined to be 0 we 
have 

T 

h{{v 1,. . .,V T }) = - v t - 1]+. 

t =1 

Proof. We prove the lemma by induction on T. The base case T = 1 is trivial. Now assume the lemma 
is true for any number of n’s smaller than T. Suppose a :) . Uj = [sj,tj\ (j = 1, • • •, M) is an optimal 
construction. Let t* G argmi n t vt- Without loss of generality assume t* belongs and only belongs to 
U\, _ jjUk for some k. By the definition of h, we must have 


h({vv T }) = v t * + h({v i - - w t *- 1}) + h({v t *+i - w t * +1 ,.. .,v T ~ w T }), 

where we define wt = \ ajl{t G Uj} and h(f)) = 0 . (Note that we also use the fact that vt* = rniiq v t 

so that vt — wt > vt — vt* is always nonnegative as required.) Now the key idea is to show that “extending” 
each Uj (j G [/;;]) to the entire time interval [1, T] does not increase the objective value in some sense. First 
consider extending U\. By the inductive assumption, we have 

h({v i -wi-ai,.. .,v si -i - w si -1 - ai,v si -w si ,.. .,v t *-i - w t *-i} 

/Sl-l 

= (vi - Wi - Qi) + I [(v t - Wt - ai) - (v t -i - w t -i - ai)] + 

\ t =2 
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l-Ol)]++( Y [(vt - Wt) - (vt- l-U7 t _i)]+J 

\t=Sl + l / 


+ [(v Sl - W S1 ) - (v Sl -l - W si -1 - 

= h({v 1 -Wi,..., v t *-i - Wf- 1}) + [(n si -w si ) - (v si -i - w si -i - ai)]+ 
- ([( v si -w 81 )~ (v si _i - w si _i)]+ + ai) 

< h({v i - wi,... - wt*-i}), 

where the inequality follows from the fact [b + c]+ < [&]+ + c. Similarly, we also have 


h({v t *+1 - uy*+i, ...,v tl - w tl ,v tl+ 1 - uy 1+ i - ai,..., v T - w T - ai} 
< /i({n t *+i - ... ,v T ~ w T }). 


By extending U 2 , ■ ■ ■ ,Uk one by one in a si mi lar way, we arrive at 

h({v 1,... ,vt}) > v t * + /i({ni .,Vf-i - v t *}) + /r({n t «+i -v t *,...,v T - v t *}). 

Notice that the right hand side of the above inequality admits a valid construction for { v \.... , vt }: an 
interval U = [1, T] with weight a = vt*, together with the optimal constructions for { v \ — vt*, ■ ■ ■ ,vt*~ 1 — 
vt*} and {iy. + i — vt*,... ,vt — vt*}. Therefore, by the optimality of h, the above inequality must be an 
equality. By using the inductive assumption again, we thus have 


h{{vi,... ,vt}) = v t * + h({v 1 - v t *,.. .,v t *-i - v t *}) + h({v t *+1 -v t *,...,vr- v t *}) 

/ t *-1 \ / T N 

=v t * + iv 1 - v t * + bt - «t-i]+ + v t *+i -v t * + y l v t ~ v t~ 1]+ 


t =2 


t=t*+2 


'f-l 


Vt=t*+1 


-Ut-ij + J + ( Y i V t-Vt-l] + 

T 

Y^Vt-Vt- 1 ] + , 


t=l 


where the last step follows from [vt* — vt* - 1 ] + = 0 by the definition of t*. This completes the proof. □ 
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